Part I

Slide 1

Bioinformatics vs statistics vs computational biology

Things bioinformaticians care about

  • Experimental design
  • The biological question
  • Statistics
  • Reproducibility
  • File formats

Experimental design

How many samples are sufficient?

  • Depends on the question
  • Depends on the technology
  • Depends on the variability

How many samples are sufficient?

The bottom line

Talk to your bioinformatician early!

Statistics

Statistics matters

What is a p-value?

\(H_0\): The null hypothesis, no effect

\(H_1\): The alternative hypothesis, there is an effect

We run a test, we get a p-value. What is it?

  • Probability that \(H_0\) is true, given the data
  • Probability that \(H_1\) is wrong, given the data
  • Probability that the data is random
  • Probability of observing the data, given \(H_0\) is true

Our intuition is bayesian, not frequentist

Frequentist Statistics Bayesian Statistics
1. Probability is defined as the long-run frequency of events 1. Probability represents a degree of belief or certainty about an event
2. Parameters (like the “true value”) are fixed but unknown quantities. 2. Parameters are treated as random variables with their own probability distributions.
3. Asking about the probability of a hypothesis does not make sense 3. Asking about the probability of a hypothesis is the main goal

Going beyond the p-value

  • Confidence intervals
  • Effect sizes
  • Power analysis

Why is that important?

P-values are the language of science, whether we like them (we don’t) or not.

  • Use effect sizes always
  • Never rely on p-values alone

Tip

You have to understand p-values and their limits to talk to other scientists!

Reproducibility

Tale of two papers

Tale of two papers

Tale of two papers

Tale of two papers

Tale of two papers

Lessons learned

  • A lot depends on how you analyze your data
  • This in turn depends on the questions you ask
  • The average “Methods” section is not sufficient for reproducible science!

Reproducible workflows with Rmarkdown

flowchart LR
    A(Program + Text) -->|knitr| B(Text with\nanalysis results)
    B --> C[LaTeX]
    C --> CC[PDF]
    B --> D[Word]
    B --> E[HTML]
    B --> F[Presentation]
    B --> G[Book]

This can be Rmarkdown, Quarto, Jupyter… the goal is that your code and your text are in one place, and the results of your calculations are entered automatically into the text.

Reproducible workflows with Rmarkdown

In systems such ar R markdown, you can put directly your analysis results in your text. For example, when I write that the \(p\)-value is equal to 0.05, I am writing this:

In systems such ar R markdown, you can put directly your
analysis results in your text. For example, when I write that the
$p$-value is equal to `​r p`, I am writing this:

The \(p\)-value above is not entered manually (as 0.05), but is the result of a statistical computation. If the data changes, if your analysis changes, the \(p\)-value above will automatically change as well.

File formats and data management

How we work

flowchart LR
    A(Excel) --> B(Data import)
    AA(CSV, TSV) --> B(Data import)
    AAA(fastq, ...) --> B(Data import)
    B --> C[Data\ncleanup]
    C --> D[Long term storage]
    C --> E[Analysis]
    E --> D
    E --> F(Figures)
    E --> G(Manuscript\nfragments)
    E --> H(Tables\nExcel files)
    F --> I[You]
    G --> I
    H --> I
    I --> E

In the diagram above, two things take usually the most hands-on time:

  • Data cleanup
  • Fine-tuning the analysis results

Excel and gene names

  • Excel converts some words to dates automatically
  • Gene names like MARCH1 are converted to dates
  • In most cases1, you can’t switch off this behavior

Excel and gene names

How (not to) work with Excel

Three reasons why you should follow these rules:

  1. Fewer chances of errors
  2. You bioinformaticians will love you
  3. The analysis will be done much faster

How (not to) work with Excel

Avoid manually change Excel files

  • Manual changes cannot be tracked automatically
  • You have to record every change you make
  • Otherwise, this is not reproducible science!

How (not to) work with Excel

Never use formatting for data

Never encode information as formatting, always use explicit columns

Color / font size / font style cannot be read automatically

How (not to) work with Excel

Don’t combine values and comments

Make a separate column for comments

Otherwise the values might be lost1

How (not to) work with Excel

Don’t put meta-information into column names

Make a separate excel sheet for column meta information

How (not to) work with Excel

(for your reference)

  • Avoid manually changing Excel files
  • Never use formatting for data
  • Don’t combine values and comments
  • Don’t put meta-information into column names
  • One sheet = one table
  • Header = one line
  • Do not use merged cells
  • Use consistent file names
  • Avoid spaces in file and column names (use underscores)

Some more tips and summaries

Things we don’t like

  • Cleaning up data
  • Data dredging
  • P-hacking
  • Post-hoc hypotheses
  • Excel
  • Manual changes like changing fonts in figures
  • Non-reproducible science

Things we love

  • Clear questions
  • A priori hypotheses
  • Challenging statistics
  • Creating new tools
  • R and Rmarkdown, or
  • Python and Jupyter
  • Reproducible workflows
  • Well organized data

Things that you should probably learn

  • Learn how to code (preferably R or Python)
  • Learn reproducible workflows with Rmarkdown or Jupyter

Thank you

You can find this presentation along its source code at https://github.com/bihealth/howtotalk

Sources

  • Source 1
  • Source 2

Slide 1

To compile, type quarto render template.qmd

Make sure you have Quarto 1.2 installed from here.

Multicolumn slide

Left column title

Left column…

Right column title

Right column (60%)…

(adding .fragment causes the contents to be displayed in steps)

Part II separator slide

Simple numbered and unnumbered lists

  • One
  • Two
  1. One
  2. Two

Incremental list

  • Item 1
  • Item 2
  • Item 3

Incremental contents

First part

Second part

This is a slide without a title (use the dashes to separate)

Transitions

Define them in the YAML header or like here, in the slide title.

Types: none, fade, slide, convex, concave, zoom

Code

plot(1:10)
Figure 1: A dumb plot

Tip

Ctrl-click on the image to zoom. And here is a 3.1415927 for you.

Code

There are many customization options for the code. For example, you can highlight (and even animate) certain lines of code:

a <- rnorm(10)
b <- rnorm(10) + a
c <- a + b * rnorm(10)

You can also specify where precisely should the output of the code go: below the code (default), on the next slide, on a right-hand column…